1 Exploring the Public Evidence on Open Access Monographs

Micah Altman, CREOS Research Scientist

[Draft: 12/1/2020]

2 Introduction

There is ongoing tension between the desire of scholars to share their work widely and openly, and the need to fund the infrastructure and labor of publishing. One place in which this tension is most evident is in the sale of scholarly monographs. While they are a only a small fraction of scholarly communications volume, market, and readership – academic monographs continue to play an important role in the humanities and social sciences. They represent an important form of long-form scholarship – not readily expressible through journal-length publications. And, as such, monograph publication through a university press forms a critical component of tenure evaluation – sometimes independent of the extent to which the monograph itself is purchased, read, or cited. (Eve 2014; Crossick 2016)

A Page from the Oldest Open Monograph

2.1 Economic Pressures on Monograph Publishing

Monograph publication has been in crisis for approximately two decades. Changes in academic library collection policies — driven, in part, by the serials crisis and the greater integration of purchase-on-demand workflows – have led to traditional monograph publishing becoming generally unprofitable. (Crow, n.d.; Spence 2018) At the same time, there is an increasing demand among scholars, research funders, and the public that the outputs of scholarship be made open access. (Guédon 2019; Science Europe, n.d.)

There are many potential funding models for open monographs(Penier, Izabella, Eve, Martin Paul, and Grady, Tom 2020; Adema, Stone, and Keene, n.d.). Currently, a number of initiatives seek to promote consortial models involving both publishers and groups of libraries. These consortial models include library crowdfunding, membership fees, subscribe-to-open transition, and the direct funding of shared infrastructure. These models act to coordinate disciplinary communities (usually through libraries as representatives); enable publishers to streamline workflow for open digital publication; and reduce potential cost-risk (to publisher and reader).

These initiatives notwithstanding, open access monographs constitute a small fraction of the total monograph titles and in the near future, and will likely make up a few percent of monograph titles published annually. (Grimme et al. 2019)

2.2 Reviewing the Evidence

Open monograph publishing remains in its early stages. The CREOS “The Economics of Scholarly Monographs” project is an examination of this area. This fall, as an initial step, we published an annotated bibliography that serves as a jumping off point for scholars to explore the effects of open availability on monograph revenues.

In this blog post we look at the open data available on monograph publication, and use it to explore patterns and trends in open monograph publishing. This blog post takes the form of a guided, interactive, reproducible data analysis based on currently available public data.1 We aim for this exploration to inform libraries, publishers, and authors about the landscape, and prepare for future transitions to open access.

3 Accessible Data on Open Monographs

The most complete index of open access monographs is the Directory of Open Access Books (DOAB), which lists tens of thousands of individual monographs (also known as ‘titles’). DOAB makes its metadata index available as open data.

# core libraries for tidy data science in R
library(tidyverse)
library(magrittr)
if (doc_debug) {
  require(tidylog)
}

## the details of data retrieval in a separate module, included in our repository
## mono_load_* loads the named data as a R data frame from cache in github
## mono_fetch_* routines are used to retrieve a new version of data from canonical source

source("fetch_data.R")

## ISBN normalization and retrieval of open descriptive metadata based on 
## these are implemented through the isbntools python module
## we install these and provide a simple R wrapper (based on reticulate)
source("isbntools.R")

if (doc_refresh_data) {
  isbn_tools_init()
  mono_fetch_doab()
  mono_fetch_oapc()
  mono_fetch_hathi()
}
doab_df <- mono_load_doab()
oapc_df <- mono_load_oapc()

The unique identifiers in the DOAB can be used to link it with other data sources.As an example, we can use the ISBN as a key to retrieve information from Google Books. For example, we can retrieve and display the cover of the most recently added title:

latest_book_isbns <- doab_df %>%
  arrange(`Added on date`) %>% 
  ungroup %>% slice_tail() %>% 
  select(`ISBN`) %>% 
  str_split(" ") %>% unlist() %>% as.character() 

if (doc_refresh_data) {
  cover_uri <- isbntools("cover",latest_book_isbns[1])[1,]
  # increase zoom level
  cover_uri %<>% str_replace("zoom=5","zoom=10")
  download.file(cover_uri, doc_sample_thumbnail_path )
}

Cover of Latest Monograph, Retrieved from Google Books

The DOAB data also provides links to the text of open monograph itself. The monograph content content is thus potentially available for harvesting, analysis, and integration with other sources. In practice, however, retrieving the content through DOAB may require some additional web scraping, as demonstrated below. For books also available in Hathitrust obtaining the content through their apis is more reliable and straightforward.

### Capture image of first page of oldest open monograph
library(rvest)

## find the oldest book in DOAB and extract its URL
oldbook_url <- doab_df %>%
  arrange(`Year of publication`) %>%
  head(n = 1L) %>%
  select(`Full text`) %>%
  as.character()

if (doc_refresh_data) {
  ## retrieve book page follow metdata embedded in webpage
  require(rvest)
  oldbook_pg <- read_html(oldbook_url)
  pdf_url <- oldbook_pg %>%
    html_nodes(xpath = '//meta[@name="citation_pdf_url"]') %>%
    html_attr("content")

  ## retrieve book and extract first page as image
  require(pdftools)
  pdf_tmpfile <- tempfile(fileext=".pdf")

  download.file(pdf_url, pdf_tmpfile)
  pdf_convert(pdf_tmpfile, page = 1, dpi = 300, file = doc_sample_image_path)
}

Two other data sources are designed to provide additional information specifically about open access monograph titles:

  • The OpenAPC project provides title-level data on processing charges, supplied by a number of consortial initiatives.

  • OpenBookPublishers provides title-level usage data on the titles it publishes.2

In addition there are a number of publicly accessible (not necessarily open) sources of metadata about large collections books generally. The most notable comprise:

  • Descriptive Metadata: ISBN registries including the service provided by OpenLibrary can be used to obtain additional descriptive metadata for titles, including subject headings. The open ISBNtools package provides a standardized way of retrieving this data from a range of registries.

  • Citations: A limited number of monographs are assigned DOI’s indexed in CrossRef, open citation data is available through the I40C initiative. Commercial citation services such as Google Scholar and Scopus, also include some citation information for selected books. This information is challenging to access systematically, but small collections can be extracted using Harzing’s Publish or Perish tool.

  • Public domain works. A range of books, including some monographs, are now open by virtue of coming out-of-copyright and into the open domain. These are not listed in DOAB – however API’s for HathiTrust and JSTOR provide descriptive metadata, rights metadata, and text-analytic metdata (e.g. ngrams) for the (open) books in their collection.

  • Prices: Amazon provides pricing API’s that can be applied to monograph titles, and a number of third parties track Amazon price histories. This data is available under restrictive terms, and in small quantities.

These sources can be merged using the ISBN key, and care but some care is required to standardize the identifier, dates and other field.3 (This is illustrated in the code below.)

In the table below you can browse a sample of titles:

## interactive sample data table
library(DT)
doab_df %>% 
  ungroup() %>% slice_head(n = 1000) %>%
  datatable(class = "cell-border stripe", caption = "Sample of DOAB Catalog",
            options = list(pageLength = 5), extensions = "Responsive")

After browsing the DOAB sample for a short while, you will likely notice glitches There are many,including missing fields, typos; undocumented and inconsistent formats for names, dates, and identifiers; multiple values packed into a single field in undocumented and inconsistent ways. These ‘dirty data’ issues are not unique to DOAB, and are in fact, ubiquitous across the data sources we examined. For further data integration, at mininum, standardization of date and ISBN fields is required, as illustrated in the codebelow.

### Data Cleaning 
## address basic issues with:
## - date standardization 
## - ISBN list packing
## - ISBN format standardization 
## - non-monograph entries

library(lubridate)

## DOAB  basic data cleaning 
doab_df %<>%
  filter(`Type` == "book") %>% 
  mutate(
    DT_PUBLISHED_YR = year(parse_date_time(`Year of publication`, "y")),
    DT_ADDED_YR = year(parse_date_time(`Added on date`, "ymd HMS")),
      LS_ID_ISBNS = str_split(
           str_replace_all(ISBN, "[^0-9\\s]X*", ""),"\\s+")
     )  %>%
  mutate(LS_ID_ISBNS =lapply(LS_ID_ISBNS,isbntools,meth="ean13"))
library(plotly)
library(ggthemes)
tmp_plot <- doab_df %>%
  group_by(`DT_ADDED_YR`) %>%
  summarize(total = n()) %>%
  ggplot() +
  aes(x = `DT_ADDED_YR` , y = `total`) +
  geom_bar(stat = "identity") +
  geom_smooth() +
  scale_color_fivethirtyeight() +
  scale_x_continuous( breaks = c(2010,2012,2014,2016,2018,2020)) +
  theme_fivethirtyeight()
ggplotly(tmp_plot)
library(gender)
library(genderdata)
# NOTE, must use devtools:: install_github("ropensci/genderdata") for all methods to function
library(humaniformat)        

doab_df %<>% mutate(LS_NM_AUTHORS=str_split(`Authors`,";"))

# parse_names fails on empty strings, wrap it# gender can fail on genderize method
safe_first_name <- possibly(first_name, otherwise="")
safe_format_reverse <- possibly(format_reverse, otherwise="")
safe_gender <- possibly(gender, otherwise=list(gender=""))

doab_df %<>% rowwise() %>% mutate(LS_NM_AUTHORS_R = list(safe_format_reverse (str_squish(`LS_NM_AUTHORS`))))

doab_df %<>% rowwise() %>% mutate(LS_NM_AUTHOR_FIRST=list(safe_first_name(`LS_NM_AUTHORS_R`)))

doab_df %<>% ungroup() %>% rowwise() %>% mutate(LS_CAT_GENDERS = list(safe_gender(`LS_NM_AUTHOR_FIRST`,method="kantrow")[["gender"]]))

doab_df %<>% rowwise() %>% mutate(
  N_GENDER_MALE=sum(LS_CAT_GENDERS=="male",na.rm=TRUE),  N_GENDER_FEMALE=sum(LS_CAT_GENDERS=="female",na.rm=TRUE),
  )
library(rpivotTable)
doab_pivot_df <- doab_df %>% transmute (
  'Publisher' = `Publisher`,
  'Opened Year' = `DT_ADDED_YR`,
  'Female Authored' = `N_GENDER_FEMALE`>0,
  'Number of Female Authors' = `N_GENDER_FEMALE`
)
doab_pivot_df %>% 
  rpivotTable(rows = "Female Authored", cols="Opened Year", vals = "Female Authors", aggregatorName = "Count", rendererName = "Table Barchart"
              )

### When Monographs Become Open

3.1 Patterns

3.1.1 APCs

library(lubridate)
## oapc cleaning
oapc_df <- mono_load_oapc()
oapc_df %<>%
  mutate(
    DT_ADDED_YR = year(parse_date_time(`period`, "y")),
    ID_ISBN_PRINT = lapply(`isbn_print`, isbntools, meth="ean13"),
    ID_ISBN_MAIN = sapply(`isbn`, isbntools, meth="ean13"), 
    ID_DOI_ISBNA =  lapply(`isbn`, isbntools, meth="doi"), 
    )
oapc_df %<>% 
  rowwise()%>%
  mutate(LS_ID_ISBNS = list(
    setdiff(unique(c(`ID_ISBN_PRINT`,`ID_ISBN_MAIN`)), "")
            ))

4 Puzzles

The exploration above raises a number of questions – under what conditions does the open availability of the monograph impact prices and sales? What are mediating factors – does the length or subject of the monograph mediate sales effects? What are potential mechanisms at play?

This exploration is limited by existing data. Each individual press has information on the sales, costs, and usage of the monographs they publish. If pooled, this data could answer potentially answer deeper questions about the economics and utility academic monographs, and could guide a transition to open access models.

5 About this Document

This is a reproducible document. The most straightforward way to examine and modify the source is to clone the module using git and then load the project using Rstudio. The source is available here, and follows tidyverse style guidelines (using styler and lintr for conformance checking).

This analysis relies primarily on the R language, with python for the ISBBNtools library. We make extensive use of the Plot.ly graphics package, and open R libraries (especially tidyverse, gender, htmlwidgets, and crosstalk and Baker’s R Makefiles).

All references in this document are managed in Zotero, We use tidyverse style guidelines

The authors describe contributions to this Essay using a standard taxonomy (see [@allen2014]) Micah Altman provided the core formulation of the essay’s goals and aims, and led the writing, methodology, data curation, and visualization. Chris Bourg and Sue Kriegsman contributed to conceptualization and provided review. CREOS research assistant Shelley Choi provided assistance with preliminary data visualization and software implementation.

This work is Copyright 2020 Micah Altman, and is Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

This work was conducted with support from the Center for Research and Equitable and Open Scholarship at the Massachusetts Institute of technology.

References

Adema, Janneke, Graham Stone, and Chris Keene. n.d. “Changing Publishing Ecologies: A Landscape Study of New University Presses and Academic-Led Publishing: A Report to JISC,” 103. http://repository.jisc.ac.uk/6666/1/Changing-publishing-ecologies-report.pdf.
Crossick, Geoffrey. 2016. “Monographs and Open Access.” Insights the UKSG Journal 29 (1): 14–19. https://doi.org/10.1629/uksg.280.
Crow, Raym. n.d. “A Rational System for Funding Scholarly Monographs: A White Paper Prepared for the AAU-ARL Task Force on Scholarly Communication,” 34. https://www.arl.org/wp-content/uploads/2012/11/aau-arl-white-paper-rational-system-for-funding-scholarly-monographs-2012.pdf.
Eve, Martin Paul. 2014. Open Access and the Humanities: Contexts, Controversies and the Future. Cambridge, United Kingdom: Cambridge University Press. https://www.cambridge.org/core/services/aop-cambridge-core/content/view/02BD7DB4A5172A864C432DBFD86E5FB4/9781107097896AR.pdf/Open_Access_and_the_Humanities.pdf?event-type=FTLA.
Grimme, Sara, Mike Taylor, Michael A. Elliott, Cathy Holland, Peter Potter, and Charles Watkinson. 2019. “The State of Open Monographs.” https://digitalscience.figshare.com/articles/The_State_of_Open_Monographs/8197625.
Guédon, Jean-Claude. 2019. Future of Scholarly Publishing and Scholarly Communication: Report of the Expert Group to the European Commission. https://doi.org/10.2777/836532.
Penier, Izabella, Eve, Martin Paul, and Grady, Tom. 2020. “COPIM Revenue Models for Open Access Monographs 2020.” https://doi.org/10.5281/ZENODO.4011836.
Science Europe. n.d. “Https://Www.ouvrirlascience.fr/Wp-Content/Uploads/2019/10/SE_on-Open-Access-to-Academic-Books_092019.pdf.” https://www.ouvrirlascience.fr/wp-content/uploads/2019/10/SE_On-Open-Access-to-Academic-Books_092019.pdf.
Spence, Paul. 2018. “The Academic Book and Its Digital Dilemmas.” Convergence: The International Journal of Research into New Media Technologies 24 (5): 458–76. https://doi.org/10.1177/1354856518772029.

  1. The source for the document is available here. Since this blog takes the form of a fully replicable analysis, new versions may be released as the data sources it relies on are updated.↩︎

  2. Monographs are typically uniquely identified through an ISBN, which is also a common choice when linking across databases. However, each ISBN is associated with specific formats (e.g. paper, hardcover, digital), so a single work published in multiple formats will have multiple ISBN’s. Further, the same ISBN may be expressed in multiple forms – so normalization is essential (ISBNtools is useful for this). Finally some databases will use DOI (digital object identifiers) or ASIN (Amazon standard identification number). instead of ISBN. Generally the correspondence across identifiers must be resolved using an index – while is a partial mapping between ASIN’s and ISBN’s – ASIN’s for printed works generally match the ISBN number, but kindle editions (and related digital works) are assigned new ASIN’s.↩︎

  3. Monographs are typically uniquely identified through an ISBN, which is also a common choice when linking across databases. However, each ISBN is associated with specific formats (e.g. paper, hardcover, digital), so a single work published in multiple formats will have multiple ISBN’s. Further, the same ISBN may be expressed in multiple forms – so normalization is essential (ISBNtools is useful for this). Finally some databases will use DOI (digital object identifiers) or ASIN (Amazon standard identification number). instead of ISBN. Generally the correspondence across identifiers must be resolved using an index – while is a partial mapping between ASIN’s and ISBN’s – ASIN’s for printed works generally match the ISBN number, but kindle editions (and related digital works) are assigned new ASIN’s. # Explorations↩︎